install.packages("tidyverse")Lab 00: Week 1
The specific aims of this lab are:
- meet some of your fellow students (there will be group work later in semester),
- refresh your knowledge of R and RStudio (or introduce you to R and RStudio if this is your first time),
- understand projects and packages in R (especially the tidyverse suite of packages),
- generate a Quarto (or R Markdown) document, and
- familiarise you with some additional resources and how to go about getting help.
The unit learning outcomes addressed are:
- LO2 Extract and combine data from multiple data resources.
- LO3 Construct, interpret and compare numerical and graphical summaries of different data types including large and/or complex data sets.
- LO8 Create a reproducible report to communicate outcomes using a programming language.
1 Getting to know each other
Your tutor will lead you through a getting to know each other exercise.
2 R and RStudio
The program/language we will be using to analyse data this semester is called R. We will mostly access R through the IDE1 RStudio. Both are free to use.
You will need to install (or upgrade to the latest version if you already have them installed):
- latest version of R (v4.4.1 or later), and
- latest version of RStudio (v 2024.04.2 or later)
2.1 Packages
When working in R, there are some functions and data sets that are always available, but the real strength of R comes from its community of developers who continually improve the set of available features and add additional functionality through an ecosystem of “packages”.
A collection of packages, mostly backed by RStudio, called the tidyverse has attracted a lot of attention in the statistics and data science sphere (Wickham et al., 2019). You can install the entire suite of tidyverse packages using the command
On Windows you might get a warning message about needing Rtools, for example something like this:
> install.packages("tidyverse")
WARNING: Rtools is required to build R packages but is not currently installed.
Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
This is just a warning (not an error) and most of the time R will still go ahead and install the most recent pre-compiled binary that it can find from CRAN. You can install Rtools if you really want but it’s pretty big and only really needed if you’re a developer or want to compile packages from source.
This will install ggplot2 (graphics), dplyr (data manipulation), readr (importing data) and a whole slew of other useful packages (Wickham, 2016; Wickham et al., 2017, 2018). You only need to use install.packages() the first time. When you actually want to use the packages, you need to load them into the environment,
When you load the tidyverse master package using library("tidyverse"), R goes and loads a bunch of other packages. It also printed out a few things, telling us which packages were loaded (and which version), and it tells us that some functions that were previously available have now been masked by the newly loaded packages. For example if you wanted to use the filter() function from the stats package, you now need to use stats::filter().
You could also load each package individually, e.g.
For an overview of the functionality in the tidyverse see R for Data Science (Wickham et al., 2022).
2.2 Palmer Archipelago penguins
We’re going to dive in the deep end. We’re going to install a package called palmerpenguins that contains a neat data set for us to experiment with (Horst et al., 2020).
Let’s install it:
install.packages("palmerpenguins")If all went well it’s now installed on your computer (which you only need to do once), but it’s not currently loaded (meaning the functionality is not yet available). We load the package using the library() function. To help think about this, installing the package adds it to our library collection but when we do that, the package is stored on the shelf in the library and not really accessible. To actually use the package we need to take it off the shelf and check it out of the library which we do using the library() function.
What does this package do? We can see the help page using ? or help()
?palmerpenguins
# help(palmerpenguins)Most packages bundle up a set of functions and make them available to the user when it is loaded. The palmerpenguins package is a little unusual in that it it doesn’t provide any functions, just two data sets, penguins and penguins_raw. We will start with the raw data in penguins_raw. We can find out a bit more about it using the help:
?penguins_rawWhen the package is loaded, the data is invisibly available (i.e. it doesn’t show up in the Global Environment) until we use it for the first time. We can get an overview of the structure of the stored data using the glimpse() function from the dplyr package:
glimpse(penguins_raw)When you use glimpse(), it shows one line for each column in the data frame, with the variable name, as well as what type of variable R thinks the column is. Can you work out what each of them mean? Do they all make sense?
Before we go any further, we need to notice that the variable names of penguins_raw do not lend themselves to easy use for coding. Specifically, spaces are tricky to deal with and special characters like parentheses or slashes aren’t great to have in a variable name. We can fix this using one of my favourite packages, the janitor package. If it’s your first time using the janitor package, you need to start with installing it:
install.packages("janitor")The janitor package has an incredibly useful function called clean_names() that sensibly sanitises column names to make it easier for subsequent analysis.
old_names = colnames(penguins_raw)
penguins = penguins_raw |>
janitor::clean_names()We stored the old names in old_names. Create a new variable called new_names with the clean column names and compare the old names and the “cleaned” names side by side using the bind_cols function from the dplyr package. Discuss the changes that have been made to the column names.
Let’s visualise some of the data using the ggplot2 package. To make use of the ggplot2 package, you need to install it (you probably already have, it comes when you install the tidyverse) and then load it (you may also have already done this if you loaded the tidyverse above).
Your tutor will work you through the details. The code below generates Figure 1, a scatter plot of flipper length against body mass and colours the points by species.
library("ggplot2")
penguins |>
ggplot() +
# add the aesthetics
aes(x = body_mass_g,
y = flipper_length_mm,
colour = species) +
# add a geometry
geom_point() +
# tidy up the labels
labs(x = "Body mass (g)",
y = "Flipper length (mm)",
colour = "Species")In Figure 1 the species variable is a bit long, we really only need to keep the first word, so let’s do that using the word() function from the stringr package (also part of the tidyverse). In the code below, we’re overwriting the species column in the penguins data frame using the mutate() function from the dplyr package.
Now regenerate the plot.
Let’s save that plot as a png file so you can print it out and stick it on the fridge!
ggsave(filename = "myfirstggplot.png")The ggplot2 cheat sheet is a great, concise resource to find out some of what’s possible. You can also access this from withing RStudio by clicking the menu item Help > Cheatsheets.
2.2.1 Exercises
- Generate a scatter plot for another pair of (numeric) variables.
- Colour by sex and use
facet_wrap()to generate a plot for each species and island combination. - Try including a line of best fit by adding another geometry layer
geom_smooth(method = "lm"). - Use a different geometry,
geom_histogram()to create a histogram for flipper length, coloured by species. - Save an updated version of your plot using
ggsave(). - Try outputting the data to a CSV file using the
write_csv()function which can be found in the readr package.
2.2.2 Advanced: interactivity
Just for fun, let’s make it interactive using the plotly package (Sievert et al., 2017).
# install.packages("plotly")
library("plotly")
myplot = penguins |>
ggplot() +
# add the aesthetics
aes(x = body_mass_g,
y = flipper_length_mm,
colour = species) +
# add a geometry
geom_point() +
# tidy up the labels
labs(x = "Body mass (g)",
y = "Flipper length (mm)",
colour = "Species")
plotly::ggplotly(myplot)2.3 Reproducible reporting
Markdown is a lightweight markup language (in the same way the HTML is a markup language). One of the big advantages of markdown as a language is its simplicity - it forces you to focus on content rather than play with styling. It’s quite straightforward to integrate your R code with markdown which is a great way to do reproducible research and generate reports. There are two main ways of integrating markdown with your code in R.
The first is R markdown, which has been around for many years. You can compile (or knit) R Markdown documents into a variety of formats, including HTML, Word, PDF, as well as presentations. For more details on using R Markdown see http://rmarkdown.rstudio.com. A useful guide to help you get started can be found here and there’s a cheat sheet here. A book on R Markdown which has everything you could possibly want to know about R Markdown and a whole lot more (Xie et al., 2018). There’s also R Markdown for Scientists which gives a more concise overview.
The more recent approach is to use Quarto. Quarto was developed by the people who created R markdown as a way to achieve something similar to R Markdown without necessarily requiring R. When used with R it is very similar to R Markdown, but it can also be used with Python, Julia, and Observable JS. While R Markdown is an R package, Quarto is a stand alone program (it comes bundled with recent versions of RStudio). You can start a new Quarto document in RStudio by going File > New File > Quarto Document.... If you don’t see this in the menu, check that you have the most recent version of RStudio installed.
Quarto was developed by the same team behind RStudio and has excellent support in the latest versions of RStudio. R Markdown is still very widely used and isn’t going anywhere, but the future of reproducible reporting with R (and likely other languages) is Quarto.
A key difference between Quarto and R Markdown is the way the YAML is specified, see the examples below. By default Quarto HTML generated files aren’t “self-contained” so if you share your HTML you’d need to share the folder that goes along with it. To avoid this, it’s always a good idea to specify embed-resources: true in the YAML - this is particularly important when it comes time to submit your HTML file in Canvas (and why it’s important to check what you’ve uploaded).
Think of Quarto as the next generation of R markdown.
If you have a lot of legacy R markdown documents, it might make sense to stay with R markdown, but for people just starting out in their data science journey it’s a good idea to work with the latest tools.
See these posts for details:
2.3.1 Super brief overview
- Create a new
qmdfile (qmd is the Quarto file extension). In RStudioFile > New File > Quarto Document... - When you have a qmd file open in RStudio there’s a
Renderbutton up the top of the source window. You click that button to turn the markdown into HTML (the default). - Text and R code can be combined in the Rmd file. Code chunks begin with three back ticks followed by
r, the (optional) chunk name and any arguments:```{r}or```{r chunk_name, tidy = TRUE}. The chunk also ends with three back ticks```. Examples can be seen in the template that opens along as a new file in RStudio (you can delete most of the template except the YAML code at the top). - There are lots of YAML options, see the documentation for details.
In Quarto chunk options can be included on their own line inside the chunk.
Example YAML:
---
title: "My awesome report"
date: "YYYY-MM-DD"
author: "Your name here"
format:
html:
### IMPORTANT ###
embed-resources: true # Creates a single HTML file as output
code-fold: show # Code folding; allows you to show/hide code chunks
### USEFUL ###
code-tools: true # Includes a menu to download the code file
### OPTIONAL ###
code-line-numbers: true # Line numbers in code chunks
df-print: paged # Sets how dataframes are automatically printed
theme: lux # Controls the font, colours, etc.
table-of-contents: true # (Useful) Creates a table of contents!
number-sections: true # (Optional) Puts numbers next to heading/subheadings
---- Create a new
Rmdfile (Rmd is the R Markdown file extension). In RStudioFile > New File > R Markdown... - When you have a Rmd file open in RStudio there’s a
Knitbutton up the top of the source window. You click that button to turn the markdown into HTML (or PDF or Word). - Text and R code can be combined in the Rmd file. Code chunks begin with three back ticks followed by
r, the (optional) chunk name and any arguments:```{r}or```{r chunk_name, tidy = TRUE}. The chunk also ends with three back ticks```. Examples can be seen in the template that opens along as a new file in RStudio (you can delete most of the template except the YAML code at the top).
Example YAML:
---
title: "My awesome report"
date: "YYYY-MM-DD"
author: "Your name here"
output:
html_document:
### IMPORTANT ###
# self_contained: true # Creates a single HTML file as output
code_folding: show # Code folding; allows you to show/hide code chunks
### USEFUL ###
code_download: true # Includes a menu to download the code file
### OPTIONAL ###
df_print: paged # Sets how dataframes are automatically printed
theme: readable # Controls the font, colours, etc.
toc: true # (Useful) Creates a table of contents!
toc_float: true # table of contents at the side
number_sections: false # (Optional) Puts numbers next to heading/subheadings
---2.3.2 Including plots
You can embed static plots in a R Markdown document without doing anything special. Important chunk options are fig.width and fig.height to set the figure width and height for example ```{r, fig.width = 4, fig.height = 6}.
2.3.3 Chunk options
Some useful chunk options:
-
tidy = TRUEmakes the R code more readable (proper spacing) -
results = 'hide'hide the results of the chunk output (i.e. don’t show them) -
results = 'hold'hold the results of the chunk output until all commands in the chunk have been run -
warning = FALSEdon’t show any warning messages (e.g. when ggplot2 drops observations) -
message = FALSEdon’t show any messages (e.g. when packages load) -
{r chunkname}you can name your chunks with text immediately after ther. This can be particularly useful when errors pop up as it makes it easier to identify which chunk the error occurs in.
2.3.4 Exercise
Take the work you did with the Palmer penguins date and write it up in a R Markdown or Quarto document. Detail what you did, including the packages and functions you used, in the text for future you. Compile (knit/render) to HTML.
When you do this, you’ll find that each time you compile your document, it re-runs all your code and loads all the libraries from scratch. This is a) a pain and b) fantastic for reproducibility. It’s a pain because you’ve already done things in the “global” environment, loaded data and packages, generated figures, etc, and it takes time for things to be re-run. It’s fantastic for reproducibility because it means that everything you do has to be in the source Rmd or qmd file for the compilation to be successful. I.e. you’ll need to load all the packages you use in the source file, you’ll need to do all the data manipulation, and include all the plot code in the source file.
3 Test submission
Later in semester you will need to submit a R Markdown/Quarto report (the first assignment). To help familiarise you with this process, we strongly recommend you try submitting your compiled HTML report to the Week 1 practice submission assignment on Canvas. We want to make sure you’re familiar with the process of uploading a HTML file.
You should always double check to make sure it has actually been submitted and looks the same as you’re expecting (see comments above about making a “self-contained” HTML file using embed-resources: true in Quarto).
There are no marks associated with Week 1 practice submission.
4 Review questions
- What does
?followed by a function or package name do? - Why would you use a
#in your R code? - What’s the difference between
install.packages("palmerpenguins")andlibrary("palmerpenguins")? - How can you check if a package is installed on your computer?
- What’s the difference between a warning, an error and a message?
Packages used:
References
Footnotes
Integrated developer environment↩︎